Introduction Sickle cell disease (SCD) is a hereditary hemolytic anemia due to abnormal polymerization of intra-erythrocytic hemoglobin, chronic microvascular hemolysis, and vaso-occlusion. These underlying pathophysiologic processes lead to the development of widespread organ dysfunction over time. Chronic kidney disease (CKD) is a common long-term outcome of SCD, primarily resulting from medullary ischemia, reduced urine concentrating ability, hyperfiltration, albuminuria, and decline in glomerular filtration rate. Nearly 30% of adults with SCD develop CKD. Progression to end stage renal disease among individuals with SCD carries a 26% one-year mortality. Risk factors for CKD identified in small retrospective cohort studies include older age, high systolic blood pressure, albuminuria, and homozygous SS and sickle-ß0-thalassemia genotypes, but the ability to identify and intervene early remains limited.

Here, we describe a first-of-its-kind application of explainable artificial intelligence (XAI) to the ASH Research Collaborative (ASH RC) Data Hub, yielding a highly sensitive CKD prediction model that generates insights into factors driving CKD risk in patients with SCD.

Methods We applied a previously validated electronic health record (EHR) phenotyping algorithm for CKD to all physician-attested SCD patients in the ASH RC Data Hub – a large, multi-site electronic health record (EHR) database totaling 27,318 patients (13,284 with physician-attested SCD diagnoses) from 17 hospital systems. For all patients in the cohort, we retrieved a set of 31 clinical features from the ASH RC Data Hub, encompassing demographics, common CKD comorbidities, lab measurements, and composite features representing healthcare utilization. We then trained random forest classifiers to predict whether a patient has progressed to CKD or not and applied Shapley Additive exPlanations (SHAP) to the trained model to describe which observations are predictive of increased risk for progression to CKD.

Results The CKD phenotyping algorithm applied to the 13,284 verified SCD patients in the ASH RC Data Hub identified 2,767 CKD cases and 959 non-CKD controls (remaining patients are of ambiguous CKD status). The random forest classifier for CKD showed high accuracy, correctly predicting CKD in 85% of patients with known CKD status (F1-score: 0.80; AUROC: 0.93). The strongest risk factors identified by SHAP are (in descending order) increased age, female sex, lower minimum hemoglobin value, having a urinalysis within 24 months of an eGFR result, and total number of creatinine measurements. Number of inpatient encounters and number of ED encounters were also highly ranked (ranks 8 and 10), showing that certain healthcare utilization metrics can predict progression to CKD.Discussion We linked two data science tasks – algorithmic phenotyping of a CKD cohort and subsequent learning of a Machine Learning-based risk prediction model – for gaining insight into patterns of progression to CKD in patients with SCD. This addresses a major challenge in complex disease research, which is a lack of statistical power stemming from restriction to a single clinical dataset and requiring manual chart review for establishing cases and controls. While the strong performance of the prediction model itself is valuable for both clinical and research applications, the SHAP explainability results convey knowledge that may fundamentally increase our understanding of CKD etiology itself. Of note, features describing healthcare utilization (e.g., number of ED or inpatient encounters, or number of a specific lab measurement) are ranked higher than many of the standard diagnostic features used by clinicians. The fact that we can rank these features' relative importance and use them in large-scale algorithmic systems has important implications for both early detection and treatment of disease. This study also serves as an important proof-of-concept for larger-scale data-driven AI analyses that will incorporate a broader set of predictors beyond those already known to be associated with CKD.

Conclusions We describe a first-of-its-kind application of XAI to better understand the dynamics underlying CKD risk in SCD patients. In the immediate future, we will expand our analysis to include all available features in the EHR, allowing for the discovery of entirely new CKD risk factors among SCD patients that can improve clinical guidelines, prevention strategies, and early intervention.

This content is only available as a PDF.
Sign in via your Institution